Does OLS (the linear probability model) work sufficiently well?
How can we build a maximum likelihood model?
The Linear Probability Model
The main justification for the LPM is that OLS is unbiased (by the Gauss Markov theorem). But \(\ldots\)
predictions are nonsensical (linear, unbounded, measures of \(\hat{y}\) rather than \(y^*\)). Despite the (mis)named Linear Probability Model, the quantities are not probabilities.
disturbances are non-normal, and heteroskedastic - this threatens inference.
relation or mapping of \(x\beta\) and \(y\) are the wrong functional form (linear).
Example - Democratic Peace data
As a running example, I’ll use the Democratic Peace data to estimate logit and probit models. These come from Oneal and Russett (1997)’s well-known study in ISQ. The units are dyad-years; the \(y\) variable is the presence or absence of a militarized dispute, and the \(x\) variables include a measure of democracy (the lowest of the two Polity scores in the dyad), and a set of controls. The principle expectation here is that as the lowest democracy score in the dyad increases, the probability of a militarized dispute decreases.
Predictions out of bounds
This figured plots the predictions from a logit and OLS model. Unsurprisingly, the logit predictions are probabilities, so are in the \([0,1]\) interval. The OLS predictions are not, and are often out of bounds.
Here’s the distribution of predictions from the OLS model - you’ll note the modal density is around .04 (which is the sample frequency of \(y\).), but that a substantial and long tail are negative, so out of probability bounds.
code
ggplot(df, aes(x=olspreds)) +geom_density(alpha=.5)+labs(title="Density of OLS Predictions", x="Predictions", y="Density")+theme_minimal()+geom_vline(xintercept=0, linetype="dashed")
Heteroskedastic Residuals
The residuals from the OLS model appear heteroskedastic, and the distribution is not normal. In fact, the distribution appears more binomial, clustered around zero and one. This shouldn’t be surprising since the \(y\) variable only takes on values of zero and one, and since we compute the residuals by \(u = y - \hat{y}\).
There seems to be a mild trend in the discipline to rehabilitate the LPM though it’s not clear why - that is, it’s hard to find statements about the advantages of doing so in any particular setting, or about the disadvantages of estimating a logit or probit model that would lead us to prefer the LPM.
OLS is a rockin’ estimator, but it’s just not well suited to limited \(y\) variables. Efforts to rehabilitate the LPM are like putting lipstick on a pig.
Limited Dependent Variables
Limited \(y\) variables are \(y\) variables where our measurement is limited by the realities of the world. Such variables are rarely normal, often not continuous, and often observable indicators of unobservable things - this is all true of binary variables.
Limited dependent variables are usually limited in the sense that we cannot observe the range of the variable or the characteristic of the variable we want to observe. We are limited to observing \(y_i\), and so must estimate \(y_i^*\).
Important Concept
It’s important to note that the problem is our data, not the OLS linear model. That model is simple and powerful. Our data are incomplete or limited realizations of the things we are interested in studying; those limitations push us away from the linear model.
unordered or nominal categorical variables: type of car you prefer: Honda, Toyota, Ford, Buick; policy choices; party or candidate choices.
ordered variables that take on few values: some survey responses.
discrete count variables; number of episodes of scarring torture in a country-year, 0, 1, 2, 3, …, \(\infty\); the number of flawed computer chips produced in a factory in a shift; the number of times a person has been arrested; the number of self-reported extramarital affairs; number of visits to your primary care doctor.
time to failure; how long a civil war lasts; how long a patient survives disease; how long a leader survives in office.
Binary \(y\) variables
Generally, we think of a binary variable as being the observable manifestation of some latent, unobserved continuous variable.
If we could adequately observe (and measure) the underlying continuous variable, we’d use some form of OLS regression to analyze that variable. But because we have limited observation, we turn to maximum likelihood methods to estimate a model that allows to use \(y\), but generate estimates of \(y^*\), the variable we wish we could measure.
Building a model for maximum likelihood
So let’s build a model for a binary \(y\) variable. In general, here’s how we’re going to approach this task with any limited \(y\) variable, binary or otherwise: we’ll take this same approach to write likelihood functions for count variables, ordered variables, etc.
Observe \(y\), consider its distribution, write down the PDF.
Write the joint probability of the data, using the chosen probability distribution.
Write the joint probability as a likelihood:
Simplify - take logs, etc.
Parameterize
Write in the link function, linking the systematic component of the model to the latent variable, \(\tilde{y}\); we’re mapping the linear prediction onto the probability space.
A nonlinear model for binary data
So \(y\) is binary, and we’ve established the linear model is not appropriate. The observed variable, \(y\), appears to be binomial (iid):
Parameterize \(\pi_i\) - make \(\pi_i\) a function of some variables and their slope effects, \(x\beta\) - this is the systematic component of the model:
Note the sigmoid functions approach the limits at decreasing rates; the fastest rate of change is at \(y=.5\), a point around which the curves are symmetric. The point \(y=.5\) is the transition point below which we’d predict a zero, above which we’d predict a one if we were interested in classifying cases into zeros and ones. Classification is a common use for models like these, say distinguishing spam from non-spam emails, or predicting the presence or absence of a disease. More on this later.
Probit and Logit LLFs
Probit - link between \(x\hat{\beta}\) and \(Pr(y=1)\) is standard normal CDF: \[
\ln \mathcal{L} (Y|\beta) = \sum_{i=1}^{N} y_i \ln \Phi(\mathbf{x_i \beta})+ (1-y_i) \ln[1-\Phi(\mathbf{x_i \beta})] \nonumber
\]
Signs and significance - all the usual rules apply.
Quantities of interest - most commonly \(Pr(y=1|X)\); or marginal effects.
Measures of uncertainty (e.g. confidence intervals) are a must (as always).
For details on prediction approaches, methods, and code, please see the predictions notes.
Predicted probabilities
In the nonlinear model, the most basic quantity is
\[F(x\widehat{\beta})\]
where \(F\) is the link function, mapping the linear prediction onto the probability space.
For the logit, the predicted probability is
\[Pr(y=1) = \frac{1}{1+exp(-x\widehat{\beta})}\]
For the probit, the predicted probability is
\[Pr(y=1) = \Phi(x\widehat{\beta})\]
Again, simply using the link function to map the linear prediction onto the probability space.
Marginal Effects
In the linear model, the marginal effect of \(x\) is \(\widehat{\beta}\). That is, the effect of a one unit change in \(x\) on \(y\) is \(\widehat{\beta}\).
The marginal effect is constant with respect to \(x_k\). Take a look:
code
x <-seq(0,10,.1)y <-2*xdf <-data.frame(x=x, y=y)ggplot(df, aes(x=x, y=y)) +geom_line()+labs(title="Marginal Effect of x on y", x="x", y="y")+theme_minimal() +annotate("text", x=5, y=15, label="y = 2x", color="black")+geom_segment(aes(x =5, xend =5, y =0, yend =10), color ="red")+geom_segment(aes(x =10, xend =10, y =0, yend =20), color ="red")
The effect of \(x\) on \(y\) is 2 - it’s the same at \(x=5\) and at \(x=10\).
In the nonlinear model, the marginal effect of \(x_k\) depends on where \(x\widehat{\beta}\) lies with respect to the probability distribution \(F(\cdot)\).
This is \(\widehat{\beta}\) weighted by or measured at the ordinate on the PDF - the ordinate is the height of the PDF associated with a value of the \(x\) axis.
Important Concept
The effect of \(x\) on \(Pr(y=1)\) is not constant; it will be large for some values of \(x\) and small for others. This makes sense if we think about the sigmoid functions - the slope of the curve is steepest at \(y=.5\), and flattens as we move away from that point toward either limit. Take another look at Figure 1
Logit Marginal Effects
Recall \(\Lambda\) is the logistic CDF = \[1/(1+exp(-x_i\widehat{\beta}))\].
\(\lambda\) is the logit PDF \[1/(1+exp(-x_i\widehat{\beta}))^2\]
So this last line indicates the marginal effect of \(x\) is the probability of a one times the probability of a zero times \(\widehat{\beta}\).
This is useful because the largest value this can take on is .25 \((Pr(y_i=1)=0.5 \cdot Pr(y_i=0)=0.5= 0.25)\) - therefore, the maximum marginal effect any \(x\) can have is \(0.25 \widehat{\beta}\).
Looking at the democratic peace model below, the coefficient on democracy is -.071, so the largest effect democracy can have on the probability of a militarized dispute is \(0.25 \cdot -.071 = -.01775\).
The ordinate at the maximum of the standard normal PDF is 0.3989 - rounding to 0.4, we can say that the maximum marginal effect of any \(\widehat{\beta}\) in the probit model is \(0.4\widehat{\beta}\).
The ordinate is at the maximum where \(z=0\); recall this is the standard normal, so \(x_i\widehat{\beta}=z\). When \(z=0\),
Not only is it simple to exponentiate \(\widehat{\beta_k}\), but the interpretation is that \(x\) increases/decreases \(Pr(y=1)\) by that factor, \(exp(\widehat{\beta_k})\), and more usefully, that:
\[
100*(exp(\widehat{\beta_k})-1) \nonumber
\]
is the percentage change in the odds given a one unit change in \(x_k\).
So a logit coefficient of .226
\[
100*(exp(.226)-1) =25.36 \nonumber
\]
Produces a 25.36% increase in the odds of \(y\) occurring.
Oneal, John R., and Bruce M. Russett. 1997. “The Classic Liberals Were Right: Democracy, Interdependence, and Conflict, 1950-1985.”International Studies Quarterly 4 (2): 267–94.